Goto

Collaborating Authors

 concentration result


Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise

arXiv.org Machine Learning

We establish maximal concentration bounds for the iterates generated by stochastic approximation algorithms with general step sizes, where the noise has a finite-state Markovian component plus a Martingale-difference component. When the Martingale-difference noise is bounded, we show that the tail of the error can be sub-Gaussian, sub-Weibull, or something lighter than any Pareto but heavier than any Weibull, depending on the step size sequence and on whether the random operator is almost surely contractive, almost surely non-expansive, or expansive with positive probability. Our analysis relies on a novel Lyapunov function involving the moment-generating function of the solution to a Poisson equation, together with an auxiliary projected algorithm. We complement the upper bounds with worst-case examples showing that qualitatively sharper bounds are impossible. We further study the case of unbounded Martingale-difference noise when the average operator is contractive, and the step sizes are of order $1/k$. In this setting, we show that if the random operator is almost surely non-expansive, then the error tail is at most three times heavier than the noise tail, whereas if the random operator is expansive with positive probability, then the error may have substantially heavier tails. These results are obtained through a novel black-box truncation argument that reduces the unbounded-noise setting to the bounded-noise case.




XXXXX

Neural Information Processing Systems

There have been multiple recent approaches to obtain a near-optimal policy in CMDPs in the regret-minimization or PAC-RL settings [13, 38, 9, 19, 31, 22, 36, 12, 15, 16, 11].



Near-optimal Rank Adaptive Inference of High Dimensional Matrices

arXiv.org Machine Learning

We address the problem of estimating a high-dimensional matrix from linear measurements, with a focus on designing optimal rank-adaptive algorithms. These algorithms infer the matrix by estimating its singular values and the corresponding singular vectors up to an effective rank, adaptively determined based on the data. We establish instance-specific lower bounds for the sample complexity of such algorithms, uncovering fundamental trade-offs in selecting the effective rank: balancing the precision of estimating a subset of singular values against the approximation cost incurred for the remaining ones. Our analysis identifies how the optimal effective rank depends on the matrix being estimated, the sample size, and the noise level. We propose an algorithm that combines a Least-Squares estimator with a universal singular value thresholding procedure. We provide finite-sample error bounds for this algorithm and demonstrate that its performance nearly matches the derived fundamental limits. Our results rely on an enhanced analysis of matrix denoising methods based on singular value thresholding. We validate our findings with applications to multivariate regression and linear dynamical system identification.


A Some concentration results for uniform random variables

Neural Information Processing Systems

In this section, we present the proof of Theorem 3.3. In Section B.1, we provide the detail of the Section B.4 and completes the proof. B.2 Upper bound on T L). (26) The KKT condition suggests that the primal-dual optimal pair ( θ This section includes additional experiment results on applying ResMem to CIFAR100 dataset. In addition to the results already presented in Section 4.2, we also evaluate ResMem performance for Figure 4: Test(left)/Training (right) accuracy for different sample sizes. This section includes additional experiment results on applying ResMem to ImageNet dataset.



Taming Polysemanticity in LLMs: Provable Feature Recovery via Sparse Autoencoders

arXiv.org Machine Learning

We study the challenge of achieving theoretically grounded feature recovery using Sparse Autoencoders (SAEs) for the interpretation of Large Language Models. Existing SAE training algorithms often lack rigorous mathematical guarantees and suffer from practical limitations such as hyperparameter sensitivity and instability. To address these issues, we first propose a novel statistical framework for the feature recovery problem, which includes a new notion of feature identifiability by modeling polysemantic features as sparse mixtures of underlying monosemantic concepts. Building on this framework, we introduce a new SAE training algorithm based on ``bias adaptation'', a technique that adaptively adjusts neural network bias parameters to ensure appropriate activation sparsity. We theoretically \highlight{prove that this algorithm correctly recovers all monosemantic features} when input data is sampled from our proposed statistical model. Furthermore, we develop an improved empirical variant, Group Bias Adaptation (GBA), and \highlight{demonstrate its superior performance against benchmark methods when applied to LLMs with up to 1.5 billion parameters}. This work represents a foundational step in demystifying SAE training by providing the first SAE algorithm with theoretical recovery guarantees, thereby advancing the development of more transparent and trustworthy AI systems through enhanced mechanistic interpretability.


An invariance principle based concentration result for large-scale stochastic pairwise interaction network systems

arXiv.org Artificial Intelligence

We study stochastic pairwise interaction network systems whereby a finite population of agents, identified with the nodes of a graph, update their states in response to both individual mutations and pairwise interactions with their neighbors. The considered class of systems include the main epidemic models -such as the SIS, SIR, and SIRS models-, certain social dynamics models -such as the voter and anti-voter models-, as well as evolutionary dynamics on graphs. Since these stochastic systems fall into the class of finite-state Markov chains, they always admit stationary distributions. We analyze the asymptotic behavior of these stationary distributions in the limit as the population size grows large while the interaction network maintains certain mixing properties. Our approach relies on the use of Lyapunov-type functions to obtain concentration results on these stationary distributions. Notably, our results are not limited to fully mixed population models, as they do apply to a much broader spectrum of interaction network structures, including, e.g., Erd\"oos-R\'enyi random graphs.